Identity and search in social networks 
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Social networks have the surprising property of being "searchable" : ordinary people are capable 
of directing messages through their network of acquaintances to reach a specific but distant target 
person in only a few steps. We present a model that offers an explanation of social network searcha- 
bility in terms of recognizable personal identities defined along a number of social dimensions. Our 
model defines a class of searchable networks and a method for searching them that may be applica- 
ble to many network search problems including the location of data files in peer-to-peer networks, 
pages on the World Wide Web, and information in distributed databases. 
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In the late 1960's, Travers and Milgram [Q conduct- 
ed an experiment in which randomly selected individ- 
uals in Boston, Massachusetts, and Omaha, Nebraska, 
were asked to direct letters to a target person in Boston, 
each forwarding his or her letter to a single acquaintance 
whom they judged to be closer than themselves to the 
target. Subsequent recipients did the same. The aver- 
age length of the resulting acquaintance chains for the 
letters that eventually reached the target (roughly 20%) 
was approximately six. This reveals not only that short 
paths exist ^ between individuals in a large social 
network but that ordinary people can find these short 
paths |||. This is not a trivial statement, since people 
rarely have more than local knowledge about the net- 
work. People know who their friends are. They may 
also know who some of their friends' friends are. But no 
one knows the identities of the entire chain of individuals 
between themselves and an arbitrary target. 

The property of being able to find a target quickly, 
which we call searchability, has been shown to exist in 
certain specific classes of networks that either possess a 
certain fraction of hubs (highly connected nodes which, 
once reached, can distribute messages to all parts of the 
network [|[ ^ Q) or are built upon an underlying geo- 
metric lattice which acts as a proxy for "social space" [Q . 
Neither of these network types, however, is a satisfactory 
model of society. 

In this paper, we present a model for a social network 
that is based upon plausible social structures and offers 
an explanation for the phenomenon of searchability. Our 
model follows naturally from six contentions about social 
networks. 

1. Individuals in social networks are endowed not 
only with network ties, but identities [||: sets of char- 
acteristics which they attribute to themselves and others 
by virtue of their association with, and participation in, 
social groups I, The term group refers to any col- 
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FIG. 1: (A) Individuals (dots) belong to groups (ellipses) 
which in turn belong to groups of groups and so on giving 
rise to a hierarchical categorization scheme. In this example, 
groups are composed of <; = 6 individuals and the hierarchy 
has I = 4 levels with a branching ratio of 6 = 2. Individuals 
in the same group are considered to be a distance x = 1 apart 
and the maximum separation of two individuals is x = I. The 
example individuals i and j belong to a category two levels 
above that of their respective groups and the distance between 
them is Xij = Individuals each have z friends in the mod- 
el and are more likely to be connected with each other the 
closer their groups are. (B) The complete model has many 
hierarchies indexed by ft = 1 ... i/, and the combined social 
distance yij between nodes i and j is taken to be the mini- 
mum ultrametric distance over all hierarchies j/ij — min^ x'lj. 
The simple example shown here for H = 2 demonstrates that 
social distance can violate the triangle inequality: yij = 1 
since i and j belong to the same group under the first hierar- 
chy and similarly yjk ~ 1 but i and k remain distant in both 
hierarchies giving yik = 4 > yij + yjk = 2. 



lection of individuals with which some well-defined set of 
social characteristics is associated. 

2. Individuals break down, or cluster, the world hier- 
archically into a series of layers, where the top layer 
accounts for the entire world and each successively deeper 
layer represents a cognitive division into a greater number 
of increasingly specific groups. In principle, this process 
of distinction by division can be pursued all the way down 
to the level of individuals, at which point each person is 
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uniquely associated with his or her own group. For pur- 
poses of identification, however, people do not typically 
do this, instead terminating the process at the level where 
the corresponding group size g becomes cognitively man- 
ageable. Academic departments, for example, are some- 
times small enough to function as a single group, but tend 
to split into specialized sub-groups as they grow larger. 
A reasonable upper bound on group size is g ~ 100, a 
number which we incorporate into our model (Fig. |i|A). 
We define the similarity xij between individuals i and j 
as the height of their lowest common ancestor level in the 
resulting hierarchy, setting Xij = 1 if i and j belong to 
the same group. The hierarchy is fully characterized by 
depth I and constant branching ratio b. The hierarchy is 
a purely cognitive construct for measuring social distance 
and not an actual network. The real network of social 
connections is constructed as follows. 

3. Group membership, in addition to defining indi- 
vidual identity, is a primary basis for social interac- 
tion jl^, and therefore acquaintanceship. As such, 
the probability of acquaintance between individuals i and 
j decreases with decreasing similarity of the groups to 
which they respectively belong. We model this by choos- 
ing an individual i at random and a link distance x with 
probability p{x) = cexp{— ax}, where a is a tunable 
parameter, and c is a normalizing constant. We then 
choose a second node j uniformly among all nodes that 
are distance x from i, repeating this process until we have 
constructed a network in which individuals have an aver- 
age number of friends z. The parameter a is therefore a 
measure of homophily — the tendency of like to associate 
with like. When e~" <^ 1, all links will be as short as 
possible, and individuals will only connect to those most 
similar to themselves (i.e., members of their own bottom- 
level group) , yielding a completely homophilous world of 
isolated cliques. By contrast, when e^" = 6, any individ- 
ual is equally likely to interact with any other, yielding 
a uniform random graph Jl2| in which the notion of indi- 
vidual similarity or dissimilarity has become irrelevant. 

4. Individuals hierarchically cluster the social world 
in more than one way (for example, by geography and 
by occupation). We assume that these categories are 
independent, in the sense that proximity in one does 
not imply proximity in another. For example, two peo- 
ple may live in the same town but not share the same 
profession. In our model, we represent each such social 
dimension by an independently partitioned hierarchy. A 
node's identity is then defined as an _ff-dimensional coor- 
dinate vector Vi, where is the position of node i in 
the hth hierarchy, or dimension. Each node i is ran- 
domly assigned a coordinate in each of H dimensions, 
and is then allocated neighbors (friends) as described 
above, where now it randomly chooses a dimension h 
(e.g. occupation) to use for each tie. When H = 1 and 
e~" ^ 1, the density of network ties must obey the con- 
straint z < g. 

5. Based on their perceived similarity with other 
nodes, individuals construct a measure of "social dis- 



tance" Uij, which we define as the minimum ultrametric 
distance over all dimensions between two nodes i and j; 
i.e., Uij = minhX^y This minimum metric captures the 
intuitive notion that closeness in only a single dimension 
is sufficient to connote affiliation (for example, geograph- 
ically and ethnically distant researchers who collaborate 
on the same project). A consequence of this minimal 
metric, depicted in Fig. |i|b, is that social distance vio- 
lates the triangle inequality — hence it is not a true met- 
ric distance — because individuals i and j can be close 
in dimension hi, and individuals j and k can be close 
in dimension /12, yet i and k can be far apart in both 
dimensions. 

6. Individuals forward a message to a single neighbor 
given only local information about the network. Here, 
we suppose that each node i knows only its own coordi- 
nate vector Vi, the coordinate vectors vj of its immediate 
network neighbors, and the coordinate vector of a giv- 
en target individual vt, but is otherwise ignorant of the 
identities or network ties of nodes beyond its immediate 
circle of acquaintances. 

Individuals therefore have two kinds of partial informa- 
tion: social distance, which can be measured globally but 
which is not a true distance and hence can yield mislead- 
ing estimates; and network paths, which generate true 
distances but which are known only locally. Although 
neither kind of information alone is sufficient to perform 
efficient searches, here we show that a simple algorithm 
that combines knowledge of network ties and social iden- 
tity can succeed in directing messages with efficiency. 
The algorithm we implement is the same greedy algo- 
rithm Milgram suggested: each member i of a message 
chain forwards the message to its neighbor j who is per- 
ceived to be closer to the target t in terms of social dis- 
tance; that is, Tjjt is minimized over all j in i's network 
neighborhood. 

Our principal objective is to determine the conditions 
under which the average length (L) of a message chain 
connecting a randomly selected sender s to a random tar- 
get t is small. Although the term small has recently been 
taken to mean that (L) grows slowly with the population 
size N [|l^, Travers and Milgram found only that 
chain lengths were short. Furthermore, these message 
chains had to be short in an absolute sense because at 
each step, they were observed to terminate with proba- 
bility p ~ 0.25 |, We therefore adopt a more real- 
istic, functional notion of efficient search, defining for a 
given message failure probability p, a searchable network 
as any network for which q, the probability of an arbi- 
trary message chain reaching its target, is at least a fixed 
value r. In terms of chain length, we formally require 
q — ((1 — p)^) > r, and from this we can obtain an 
estimate of the maximum required (L) using the approx- 
imated inequality (L) < In r/ In (1 — p). For the purposes 
of this paper, we set r — 0.05 and p — 0.25 giving the 
stringent requirement that (L) < 10.4 independent of the 
population size N. Fig. ||A presents a typical phase dia- 
gram in H and a outlining the searchable network region 
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FIG. 2: (A) Regions in H-a space where searchable net- 
works exist for varying numbers of individual nodes A'' (prob- 
ability of message failure p = 0.25, branching ratio h = 2, 
group size g = 100, average degree z = g — 1 = 99, 10" 
chains sampled per network). The searchability criterion is 
that the probability of message completion q must be at least 
r = 0.05. The lines correspond to boundaries of the search- 
able network region for = 102400 (solid), N = 204800 
(dot-dash), and A'^ = 409600 (dash). The region of search- 
able networks shrinks with A'^, vanishing at a finite value of 
A'^ which depends on the model parameters. Note that z < g 
is required to explore H-a space since for H = 1 and a suffi- 
ciently large, an individual's neighbors must all be contained 
within their sole local group. (B) Probability of message 
completion q{H) when a = (squares) and a = 2 (circles) 
for the A' = 102400 data set used in a. The horizontal line 
shows the position of the threshold r = 0.05. Open symbols 
indicate the network is searchable [q > r) and closed sym- 
bols mean otherwise. For a = 0, searchability degrades with 
each additional hierarchy. For the homophilous case of a = 2 
with a single hierarchy, less than one percent of all searches 
find their target {q ~ 0.004). Adding just one other hierarchy 
increases the success rate to g ~ 0.144 and q slowly decreases 
with H thereafter. 



for several choices of N , g ~ 100, and z = g — I = 99. 

Our main result is that searchable networks occupy 
a broad region of parameter space (a, H) which, as we 
argue below, corresponds to choices of the model param- 
eters that are the most sociologically plausible. Hence 
our model suggests that searchability is a generic prop- 
erty of real-world social networks. We support this claim 
with some further observations, and demonstrate that 
our model can account for Milgram's experimental find- 



ings. 

First, we observe that almost all searchable networks 
display a > and H > 1, consistent with the notion 
that individuals are essentially homophilous (that is, they 
associate preferentially with like individuals), but judge 
similarity along more than one social dimension. Neither 
the precise degree to which they are homophilous, nor the 
exact number of dimensions they choose to use, appear 
to be important — almost any reasonable choice will do. 
The best performance, over the largest interval of a, is 
achieved for H — 2 or 3 — an interesting result in light of 
empirical evidence ]l6[ that individuals across different 
cultures in small-world experiments typically utilize two 
or three dimensions when forwarding a message. 

Second, as Fig. ||B shows, while increasing the num- 
ber of independent dimensions from H = 1 yields a dra- 
matic reduction in delivery time for values of a > 0, 
this improvement is gradually lost as H is increased fur- 
ther. Hence the window of searchable networks in Fig. 
exhibits an upper boundary in H. Because ties associated 
with any one dimension are allocated independently with 
respect to ties in any other dimension, and because for 
fixed average degree z, larger H necessarily implies fewer 
ties per dimension, the network ties become less correlat- 
ed as H increases. In the limit of large H, the network 
becomes essentially a random graph (regardless of a) and 
the search algorithm becomes a random walk. Effective 
decentralized search therefore requires a balance (albeit 
a highly forgiving one) of categorical flexibility and con- 
straint. 

Finally, by introducing parameter choices that are con- 
sistent with Milgram's experiment {N = 10^, p = 0.25) 
as well as with subsequent empirical findings (z = 
300, H ~ 2)||l^, we can compare the distribution 
of chain lengths in our model with those of Travers and 
Milgram for plausible values of a and b. As Fig. ^ 
shows, we obtain (L) ~ 6.7 for a — 1 and b = 10, indi- 
cating that our model captures the essence of the real 
small- world problem. 

Although sociological in origin, our model is relevant 
to a broad class of decentralized search problems, such 
as peer-to-peer networking, in which centralized servers 
are excluded either by design or by necessity, and where 
broadcast-type searches (i.e., forwarding messages to all 
neighbors rather than just one) are ruled out due to con- 
gestion constraints Q. In essence, our model applies to 
any data structure in which data elements exhibit quan- 
tifiable characteristics analogous to our notion of identi- 
ty, and similarity between two elements — whether peo- 
ple, music files, web pages, or research reports — can be 
judged along more than one dimension. One of the prin- 
cipal difficulties with designing robust databases is 
the absence of a unique classification scheme which all 
users of the database can apply consistently to place and 
locate files. Two musical songs, for example, can be sim- 
ilar because they belong to the same genre or because 
they were created in the same year. Our model trans- 
forms this difficulty into an asset, allowing all such clas- 
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sification schemes to exist simultaneously, and connecting 
data elements preferentially to similar elements in multi- 
ple dimensions. Efficient decentralized searches can then 
be conducted utilizing simple, greedy algorithms provid- 
ing only that the characteristics of the target element and 
the current element's immediate neighbors are known. 
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FIG. 3: Comparison between n{L), the number of completed 
chains of length L, taken from the original small- world exper- 
iment (bar graph) and from an example of our model with 
N = 1(F individuals (filled circles with the line being a guide 
for the eye). The experimental data shown are for the 42 com- 
pleted chains that originated in Nebraska. (We have exclud- 
ed the 24 completed chains that originated in Boston as this 
would correspond to ~ lO''.) The model parameters are 
H = 2, a = 1, b = 10, g = 100, and z = 300; message attrition 
rate is set at 25%; n{L) for the model is compiled from 10^ 
random chains and is normalized to match the 42 completed 
chains that started in Nebraska. The average chain length of 
Milgram's experiment is approximately 6.5 while the model 
yields (L) ~ 6.7. The distributions compare well: a two-sided 
Kolmogorov-Smirnov test yields a p-value P ~ 0.57 while for 
a test, — 5.46 and P ~ 0.49 (seven bins). (A large 
value of P supports the hypothesis that the distributions are 
similar.) Even without attrition, the model's average search 
time is (L) ~ 8.5 and the median chain length is 8. The mod- 
el does not entirely match the experimental data since the 
former requires approximately 360 initial chains to achieve 42 
completions as compared to 196. 
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